SEA-HELM

(Southeast Asian Holistic Evaluation of Language Models)

SEA-HELM is an assessment of large language models across various tasks, with an emphasis on Southeast Asian languages. The leaderboard evaluates models across key multilingual capabilities such as proficiency in Southeast Asian chat, instruction-following in Southeast Asian languages, Southeast Asian linguistic tasks and performance on a suite of English tasks.

SEA Overall

Average of 30 bootstraps. 95% CI are shown.

Model Size: ≤200B

Open instruct models only

Google logo
Gemma 4
31B
72.02±0.03
AISG logo
SEA-LION v4.5 (Qwen)
27B
69.97±0.05
Alibaba logo
Qwen 3.5
122B
MoE
68.40±0.06
Alibaba logo
Qwen 3.5
27B
68.34±0.07
Google logo
Gemma 4
26B
MoE
68.18±0.03

View all scores →

Performance for each SEA Language

Burmese

Average of 30 bootstraps. 95% CI are shown.

Model Size: ≤200B

Open instruct models only

Google logo
Gemma 4
31B
67.29±0.08
Google logo
Gemma 4
26B
MoE
61.39±0.12
AISG logo
SEA-LION v4.5 (Qwen)
27B
61.11±0.16
Alibaba logo
Qwen 3.5
27B
59.35±0.19
Alibaba logo
Qwen 3.5
122B
MoE
59.23±0.20

View all scores →

Filipino

Average of 30 bootstraps. 95% CI are shown.

Model Size: ≤200B

Open instruct models only

Google logo
Gemma 4
31B
76.32±0.08
Google logo
Gemma 4
26B
MoE
72.00±0.09
AISG logo
SEA-LION v4.5 (Qwen)
27B
71.99±0.15
Alibaba logo
Qwen 3.5
122B
MoE
71.13±0.18
DeepSeek logo
DeepSeek V4 Flash
158B
71.08±0.19

View all scores →

Indonesian

Average of 30 bootstraps. 95% CI are shown.

Model Size: ≤200B

Open instruct models only

Google logo
Gemma 4
31B
77.36±0.07
AISG logo
SEA-LION v4.5 (Qwen)
27B
74.51±0.13
Alibaba logo
Qwen 3.5
27B
74.38±0.14
Alibaba logo
Qwen 3.5
122B
MoE
74.18±0.12
Alibaba logo
Qwen 3.6
27B
73.00±0.15

View all scores →

Malay

Average of 30 bootstraps. 95% CI are shown.

Model Size: ≤200B

Open instruct models only

Google logo
Gemma 4
31B
70.89±0.09
Alibaba logo
Qwen 3.5
122B
MoE
69.77±0.13
AISG logo
SEA-LION v4.5 (Qwen)
27B
69.69±0.10
Alibaba logo
Qwen 3.5
27B
69.61±0.20
Google logo
Gemma 4
26B
MoE
68.76±0.08

View all scores →

Tamil

Average of 30 bootstraps. 95% CI are shown.

Model Size: ≤200B

Open instruct models only

Google logo
Gemma 4
31B
75.41±0.16
AISG logo
SEA-LION v4.5 (Qwen)
27B
75.00±0.14
Alibaba logo
Qwen 3.5
27B
72.77±0.18
Alibaba logo
Qwen 3.5
122B
MoE
72.23±0.14
Google logo
Gemma 4
26B
MoE
72.20±0.16

View all scores →

Thai

Average of 30 bootstraps. 95% CI are shown.

Model Size: ≤200B

Open instruct models only

AISG logo
SEA-LION v4.5 (Qwen)
27B
66.42±0.11
Alibaba logo
Qwen 3.6
27B
63.24±0.19
Alibaba logo
Qwen 3.5
27B
63.08±0.13
Google logo
Gemma 4
31B
62.35±0.07
Alibaba logo
Qwen 3 VL
32B
61.74±0.10

View all scores →

Vietnamese

Average of 30 bootstraps. 95% CI are shown.

Model Size: ≤200B

Open instruct models only

Google logo
Gemma 4
31B
74.51±0.08
AISG logo
SEA-LION v4.5 (Qwen)
27B
71.08±0.13
Alibaba logo
Qwen 3.5
122B
MoE
71.05±0.19
Google logo
Gemma 4
26B
MoE
69.80±0.11
Alibaba logo
Qwen 3.5
27B
69.55±0.15

View all scores →